In [3]:
import pandas as pd
import numpy as np
In [4]:
products = pd.read_csv('amazon_baby.csv')
In [16]:
products = products.fillna({'review':''}) # fill in N/A's in the review column
In [17]:
def remove_punctuation(text):
import string
return text.translate(None, string.punctuation)
products['review_clean'] = products['review'].apply(remove_punctuation)
In [18]:
products.head(5)
Out[18]:
In [19]:
products = products[products['rating'] != 3]
Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column. In SFrame, you would use apply():
In [21]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
In [37]:
products.head(3)
Out[37]:
In [43]:
import json
with open('test_data_idx.json') as test_data_file:
test_data_idx = json.load(test_data_file)
with open('train_data_idx.json') as train_data_file:
train_data_idx = json.load(train_data_file)
print train_data_idx[:3]
print test_data_idx[:3]
In [46]:
train_data = products.iloc[train_data_idx]
train_data.head(2)
Out[46]:
In [45]:
test_data = products.iloc[test_data_idx]
test_data.head(2)
Out[45]:
We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as bag-of-word features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:
In [47]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
# Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])
#print vectorizer.vocabulary_
Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.
In [48]:
from sklearn.linear_model import LogisticRegression
sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix, train_data['sentiment'])
Out[48]:
There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.
In [53]:
np.sum(sentiment_model.coef_ >= 0)
Out[53]:
Now that a model is trained, we can make predictions on the test data. In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the SFrame test_data and print their content:
In [60]:
sample_test_data = test_data.iloc[10:13]
print sample_test_data
In [61]:
sample_test_data.iloc[0]['review']
Out[61]:
In [62]:
sample_test_data.iloc[1]['review']
Out[62]:
In [66]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print scores
print sentiment_model.predict(sample_test_matrix)
These scores can be used to make class predictions as follows:
Using scores, write code to calculate predicted labels for sample_test_data.
Checkpoint: Make sure your class predictions match with the ones obtained from sentiment_model. The logistic regression classifier in scikit-learn comes with the predict function for this purpose.
In [75]:
print [1./(1+np.exp(-x)) for x in scores]
In [76]:
print sentiment_model.classes_
print sentiment_model.predict_proba(sample_test_matrix)
We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points.
Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review. We refer to these as the "most positive reviews."
To calculate these top-20 reviews, use the following steps:
In [87]:
test_scores = sentiment_model.decision_function(test_matrix)
positive_idx = np.argsort(-test_scores)[:20]
print positive_idx
print test_scores[positive_idx[0]]
test_data.iloc[positive_idx]
Out[87]:
In [86]:
negative_idx = np.argsort(test_scores)[:20]
print negative_idx
print test_scores[negative_idx[0]]
test_data.iloc[negative_idx]
Out[86]:
We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by
$$ accuracy=\frac{\# correctly classified examples}{ \# total examples}$$This can be computed as follows:
In [88]:
predicted_y = sentiment_model.predict(test_matrix)
correct_num = np.sum(predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
accuracy = correct_num * 1./ total_num
print accuracy
In [91]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves',
'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed',
'work', 'product', 'money', 'would', 'return']
In [92]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])
In [93]:
simple_model = LogisticRegression()
simple_model.fit(train_matrix_word_subset, train_data['sentiment'])
Out[93]:
In [102]:
simple_model_coef_table = pd.DataFrame({'word':significant_words,
'coefficient':simple_model.coef_.flatten()})
#simple_model_coef_table
simple_model_coef_table.sort_values(['coefficient'], ascending=False)
Out[102]:
In [103]:
len(simple_model_coef_table[simple_model_coef_table['coefficient']>0])
Out[103]:
In [ ]:
model_coef_table = pd.DataFrame({'word':significant_words,
'coefficient':simple_model.coef_.flatten()})
#simple_model_coef_table
simple_model_coef_table.sort_values(['coefficient'], ascending=False)
In [118]:
vectorizer_word_subset.get_feature_names()
Out[118]:
In [120]:
train_predicted_y = sentiment_model.predict(train_matrix)
correct_num = np.sum(train_predicted_y == train_data['sentiment'])
total_num = len(train_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
train_accuracy = correct_num * 1./ total_num
print "sentiment_model training accuracy: {}".format(train_accuracy)
train_predicted_y = simple_model.predict(train_matrix_word_subset)
correct_num = np.sum(train_predicted_y == train_data['sentiment'])
total_num = len(train_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
train_accuracy = correct_num * 1./ total_num
print "simple_model training accuracy: {}".format(train_accuracy)
In [122]:
test_predicted_y = sentiment_model.predict(test_matrix)
correct_num = np.sum(test_predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
test_accuracy = correct_num * 1./ total_num
print "sentiment_model test accuracy: {}".format(test_accuracy)
test_predicted_y = simple_model.predict(test_matrix_word_subset)
correct_num = np.sum(test_predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
test_accuracy = correct_num * 1./ total_num
print "simple_model test accuracy: {}".format(test_accuracy)
It is quite common to use the majority class classifier as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.
In [124]:
positive_label = len(test_data[test_data['sentiment']>0])
negative_label = len(test_data[test_data['sentiment']<0])
print "positive_label is {}, negative_label is {}".format(positive_label, negative_label)
In [125]:
baseline_accuracy = positive_label*1./(positive_label+negative_label)
print "baseline_accuracy is {}".format(baseline_accuracy)